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Abstract 

We introduce the Pitman Yor Diffusion 
Tree (PYDT) for hierarchical clustering, 
a generalization of the Dirichlct Diffusion 



Tree ( Neal . 2001 ) which removes the restric- 



tion to binary branching structure. The gen- 
erative process is described and shown to re- 
sult in an exchangeable distribution over data 
points. We prove some theoretical proper- 
ties of the model and then present two infer- 
ence methods: a collapsed MCMC sampler 
which allows us to model uncertainty over 
tree structures, and a computationally effi- 
cient greedy Bayesian EM search algorithm. 
Both algorithms use message passing on the 
tree structure. The utility of the model and 
algorithms is demonstrated on synthetic and 
real world data, both continuous and binary. 

Tree structures play an important role in machine 
learning and statistics. Learning a tree structure over 
data points gives a straightforward picture of how ob- 
jects of interest are related. Trees are easily inter- 
preted and intuitive to understand. Sometimes we may 
know that there is a true underlying hierarchy: for ex- 
ample species in the tree of life or duplicates of genes in 
the human genome, known as paralogs. Typical mix- 
ture models, such as Dirichlet Process mixture models, 
have independent parameters for each component. We 
might expect for example that certain clusters are sim- 
ilar, for example are sub-groups of some large group. 
By learning this hierarchical similarity structure, the 
model can share statistical strength between compo- 
nents to make better estimates of parameters using 
less data. 

Classical hierarchical clustering algorithms employ a 
bottom up "agglomerative" approach (Duda et al. 



2001) which hides the statistical assumptions being 



probabilistic model in lieu of a distance metric but 
simply view the hierarchy as a tree consistent mixture 
over partitions of the data. If instead a full genera- 
tive model for both the tree structure and the data is 



used (Williams 2000 Neal 2003b Teh et al. 2008 



Blei et al. 2010) Bayesian inference machinery can be 



used to compute posterior distributions over the tree 
structures themselves. Such models can also be used to 



learn hierarchies over latent variables (Rai and Daume 



III 2008). 



Both heuristic and generative probabilistic approaches 
to learning hierarchies have focused on learning binary 
trees. Although computationally convenient this re- 
striction may be undesirable: where appropriate, ar- 
bitrary trees provide a more interpretable, clean sum- 
mary of the data. Some recent work has aimed to 



address this. Blundell et al. (2010) extend Heller and 



Ghahramani ( 2005 ) by removing the restriction to bi- 



nary trees. However, as for Heller and Ghahramani 



( 2005 ) the lack of a generative process prohibits mod- 



eling uncertainty over tree structures. Williams (2000) 



allows nonbinary trees by having each node indepen- 
dently pick a parent in the layer above, but requires 
one to pre-specify the number of layers and number of 
nodes in each layer. Blei et al. (2010) use the nested 



Chinese restaurant process to define probability dis- 
tributions over tree structures. Each data point is 
drawn from a mixture over the parameters on the path 
from the root to the data point, which is appropri- 
ate for mixed membership models but not standard 
clustering. An alternative to the PYDT to obtain un- 



bounded trees is given by Adams et al. (2010). They 



made. Heller and Ghahramani (2005) use a principled 



use a nested stick-breaking process to construct the 
tree, which is then endowed with a diffusion process. 
Data live at internal nodes of the tree, rather than at 
leaves as in the PYDT. 

We introduce the Pitman Yor Diffusion Tree (PYDT) , 
a generalization of the Dirichlet Diffusion Tree ( |Neal| 
|2001[ ) to trees with arbitrary branching structure. 
While allowing atoms in the divergence function of the 



DDT can in principle be used to obtain multifurcating 
branch points ( Neal 2003b[ ) , our solution is both more 
flexible and more mathematically and computationally 
tractable. An interesting property of the PYDT is 
that the implied distribution over tree structures cor- 
responds to the multifurcating Gibbs fragmentation 



tree (McCullagh et al. 2008), known to be the most 



general process generating exchangeable and consis- 
tent trees (here consistency can be understood as co- 
herence under marginalization of subtrees). 

Our contributions are as follows. In Section [l] we 
describe the generative process corresponding to the 
PYDT. In Section[2]we derive the probability of a tree, 
and in Section[3]show some important properties of the 
process. Section[4]describes our hierarchical clustering 
models utilising the PYDT. In Section[5]we present an 
MCMC sampler and a greedy EM algorithm, which we 
developped for the DDT in |Knowles et al.| ( |201l| . We 
present results demonstrating the utility of the PYDT 
in Section [6j In the supplementary material we de- 
scribe how to sample from the PYDT. 

1 Generative process 

We will describe the generative process for the data 
in terms of a diffusion process in fictitious "time" on 
the unit interval. The observed data points (or la- 
tent variables) correspond to the locations of the dif- 
fusion process at time t = 1. The first datapoint 
starts at time at the origin in a Z3-dimensional Eu- 
clidean space and follows a Brownian motion with 
variance a 1 until time 1. If datapoint 1 is at po- 
sition x\(t) at time t, the point will reach position 
xi(t + dt) ~ N(xi(t), a 2 Idt) at time t + dt. It can 
easily be shown that X\(t) ~ Normal(0, a 2 It). The 
second point x% in the dataset also starts at the origin 
and initially follows the path of x\. The path of xi will 
diverge from that of x\ at some time Tj, after which x^ 
follows a Brownian motion independent of x\(t) until 
t — 1, with Xi{\) being the z-th data point. The prob- 
ability of diverging in an interval [t + dt] is determined 
by a "divergence function" a(t) (see Equation [I] below) 
which is analogous to the hazard function in survival 
analysis. 

The generative process for datapoint i is as follows. 
Initially Xi(t) follows the path of the previous data- 
points. If at time t the path of Xi(t) has not diverged, 
it will diverge in the next infinitesimal time interval 
[t, t + dt] with probability 



a{t)T(m - /3)dt 
T(m + 1 + a) 



(1) 



where m is the number of datapoints that have previ- 
ously followed the current path and < j3 < I, a > 



—2/3 are parameters of the model. In the special case of 
integer a £ N and (3 = the probability of divergence 
reduces to a(t)dt/[m(m + 1) . . . (m + a — l)(m + a)]. 
For example for a = 1 this gives a(t)dt/[m(m + 1)], 
and for a — (3 = the DDT expression a(t)dt/m is 
recovered. If Xi does not diverge before reaching a 
previous branching point, it may either follow one of 
the previous branches, or diverge at the branch point 
(adding one to the degree of this node in the tree). The 
probability of following one of the existing branches k 
is 



- a 



(2) 



where bk is the number of samples which previously 
took branch k and m is the total number of samples 
through this branch point so far. The probability of di- 
verging at the branch point and creating a new branch 



a + (3K 



(3) 



where K is the number of branches from this branch 
point. By summing Equation [2] over k — {1, . . . , K} 
with Equation [3] we get 1 as required. This rein- 
forcement scheme is analogous to the Pitman Yor pro- 
cess (Teh 2006[) ver sion of the Chinese restaurant pro- 
cess (Aldous, 1985). For the single data point Xi(t) 
this process is iterated down the tree until divergence, 
after which Xi(t) performs independent Brownian mo- 
tion until time t = 1. The i-th observed data point is 
given by the location of this Brownian motion at t = 1, 
i.e. Xi(l). 

2 Probability of a tree 

We refer to branch points and leaves of the tree as 
nodes. The probability of generating a specific tree 
structure with associated divergence times and loca- 
tions at each node can be written analytically since 
the specific diffusion path taken between nodes can 
be ignored. We will need the probability that a new 
data point does not diverge between times s < t on 
a branch that has been followed m times by previ- 
ous data-points. This can straightforwardly be derived 
from Equation [TJ 



not diverging 
in [s, t] 



exp 



(A(s)-A(t)) 



T(m- P) 
r(m + 1 + a) 

(4) * 



where A(t) — j Q a(u)du is the cumulative rate func- 
tion. 

Consider the tree of N = 4 data points in Figure [l] 
The probability of obtaining this tree structure and 



associated divergence times is: 
T{2 + a) 

2 + a T(2 + a) 
-Aft ) r(3-«) a + 2/3 

3 + a 

The first data point does not contribute to the expres- 
sion. The second point contributes the first line: the 
first term results from not diverging between t — and 
t a , the second from diverging at t a . The third point 
contributes the second line: the first term comes from 
not diverging before time t a , the second from choosing 
the branch leading towards the first point, the third 
term comes from not diverging between times t a and 
and the final term from diverging at time t b . The 
fourth and final data point contributes the final line: 
the first term for not diverging before time t a and the 
second term for diverging at branch point a. 

The component resulting from the divergence and data 
locations for the tree in Figure [T] is 

N{x 1 ;Q,a 2 )N{x 2 ;x a ,a 2 {l-t a )) 
xN(x 3 ; x b , cr 2 (l - t b ))N(x 4 ; x a , a 2 (l - t a )) 

where each data point has contributed a term. We can 
rewrite this as: 

N{x a ;0,a 2 t a )N(x b ; x a , cr (t b - t a )) 
xN(x x ;x b , cr 2 (l - t b )) x N(x 2 ; x a ,a 2 {\ - t a )) 
xN{x 3 ;x b , <7 2 (1 - t b ))N(x 4 ; x a , a 2 (l - Q) (5) 

to see that there is a Gaussian term associated with 
each branch in the tree. 

3 Theory 

Now we present some important properties of the 
PYDT generative process. 

Lemma 1. The probability of generating a specific tree 
structure, divergence times, divergence locations and 
corresponding data set is invariant to the ordering of 
data points. 

Proof. The probability of a draw from the PYDT can 
be decomposed into three components: the probabil- 
ity of the underlying tree structure, the probability 
of the divergence times given the tree structure, and 
the probability of the divergence locations given the 
divergence times. We will show that none of these 
components depend on the ordering of the data. Con- 
sider the tree T as a set of edges S(T) each of which 




Figure 1: A sample from the Pitman- Yor Diffusion 
Tree with N — 4 datapoints and a(t) = 1/(1 — t), a = 
l,/3 = 0. Top: the location of the Brownian motion 
for each of the four paths. Bottom: the corresponding 
tree structure. Each branch point corresponds to an 
internal tree node. 

we will see contributes to the joint probability den- 
sity. The tree structure T contains the counts of 
how many datapoints traversed each edge. We de- 
note an edge by [ab] £ S(T), which goes from node 
a to node b with corresponding locations x a and x b 
and divergence times t a and t b . Let m(b) be the num- 
ber of samples to have passed through b. Denote by 
S'(T) = {[ab] € S(T) : m{b) > 2} the set of all edges 
traversed by m > 2 samples (for divergence functions 
which ensure divergence before time 1 this is the set 
of all edges not connecting to leaf nodes). 

Probability of the tree structure. For segment [ab], let 
i be the index of the sample which diverged to create 
the branch point at b, thereby contributing a factor 

V{i + a) 

Let the number of branches from b be K b , and the 
number of samples which followed each branch be 
{n\ : k e [1 . . . K b ]}. The total number of datapoints 
which traversed edge [ab] is mib) = ^ can 

be shown (see Appendix [A} that the factor associated 
with this branching structure for the data points after 
i is 

n£ 3 [<» + (fc - mni + <*) nS r K - g) 

r(i - 1 + (3)T(m(b) + a) 
Multiplying by the contribution from data point i in 



Equation [6] we have 

<M n§> + (fc - m n£i n< ~ g) 

r(m(6) + a) 



(7) 



Each segment [a&] € S'(T) contributes such a term. 
Since this expression does not depend on the ordering 
of the branching events, the overall factor does not 
either. 

Probability of divergence times. The m(b) — 1 points 
that followed the first point along this path did not 
diverge before time % (otherwise [ab] would not be 
an edge) , which from Equation [4] we see contributes a 
factor 



i(b)-i 

n exp 



= exp 



(A(t a ) - A{t b )) 



r(j - P) 

T{i + l + o 



[{A{t a )-A{t b ))H^ b) _ x 



(8) 

where we define H"' 13 — Y%=i r§+T+«j • ^ n edges 
[ab] £ <5'(T) contribute the expression in Equation [8] 
resulting in a total contribution 



n ex p 

[ab]eS'(T) 



(AiQ-A^H^ 



(9) 



This expression does not depend on the ordering of the 
datapoints. 

Probability of node locations. Generalizing Equation[5] 
it is clear that each edge contributes a Gaussian factor, 
resulting an overall factor: 



Y[ N(x b ;x a ,a 2 (t b -t a )I) 

[ab]eS{T) 



(10) 



The overall probability of a specific tree, divergence 
times and node locations is given by the product of 
Equations [7J [9] and 10 none of which depend on the 
ordering of the data. □ 

The term Flit^aC + (k — l)fj] in Equation [7] can be 
calculated efficiently depending on the value of (3. For 



f3 = we have rifc=3 a = 



„K-2 



For ^/Owe have 



fj [a + (k - l)f3] = (i K »- 2 TJ [a/0 +(k- 1)] 



k=3 



k=3 



= l3 K >- 2 T(a/f3 + K b ) 
T(a/p + 2) 

Theorem 1. The Pitman- Yor Diffusion Tree de- 
fines an infinitely exchangeable distribution over data 
points. 



Proof. Summing over all possible tree structures, and 
integrating over all branch point times and locations, 
by Lemma [T] we have infinite exchangeability. □ 

Corollary 1. There exists a prior v on probability 
measures on M. D such that the samples x\, X2, ■ ■ ■ gen- 
erated by a PYDT are conditionally independent and 
identically distributed (iid) according to T ~ V , that 
is, we can represent the PYDT as 



PYDT( Xl ,x 2 ,...) 




dv(F) 



Proof. Since the PYDT defines an infinitely exchange- 
able process on data points, the result follows di- 



rectly by de Finetti's Theorem (Hewitt and Savage 



1955) 



□ 



Another way of expressing Corollary 1 is that data 
points xi,...,xn sampled from the PYDT could 
equivalently have been sampled by first sampling a 
probability measure J- ~ v, then sampling Xi ~ T 
iid for all i in {1, ...,N}. For divergence functions 
such that A(l) is infinite, the probability measure T 
is continuous almost surely. 

Lemma 2. The PYDT reduces to the Diffusion 
Dirichlet Tree (Neal 2001) in the case a — f3 



0. 



Proof. This is clear from the generative process: for 
a = /3 = there is zero probability of branching at 
a previous branch point (assuming continuous cumu- 
lative divergence function A(t)). The probability of 
diverging in the time interval [t, t + dt] from a branch 
previously traversed by m datapoints becomes: 



a(t)T(m - Q)dt _ a(t)(m - l)\dt _ a(t)dt 
r(m +1 + 0) m! m 

as for the DDT. 



(11) 

□ 



It is straightforward to confirm that the DDT proba- 
bility factors are recovered when a — j3 — 0. In this 
case K = 2 since non-binary branch points have zero 
probability, so Equation [7] reduces as follows: 



<h) Ul'i 2 rK - °) - i)K&2 - 1)! 



T(m(b) +0) 



(m(&) - 1)! 



as for the DDT. Equation [9] also reduces to the DDT 
expression since 

ff o.o_y r(i-O) " (z-l)! _» 1_ 

n "^r^ + i + o) _ ^ i\ -2^,-^ 

where H n is the n-th Harmonic number. 




Figure 2: The effect of varying a on the log probabil- 
ity of two tree structures, indicating the types of tree 
preferred. Small a < 1 favors binary trees while larger 
values of a favors higher order branching points. 

For the purpose of this paper we use the divergence 
function a(t) — with "smoothness" parameter 

c > 0. Larger values c > 1 give smoother densities be- 
cause divergences typically occur earlier, resulting in 
less dependence between the datapoints. Smaller val- 
ues c < 1 give rougher more "clumpy" densities with 
more local structure since divergence typically occurs 
later, closer to t = 1. For this divergence function we 
have A(t) = -clog (1 - t). 

Equation [9] factorizes into a term for t a and %■ Col- 
lecting such terms from the branches attached to an 
internal node b the factor for for the divergence func- 
tion a(t) = c/(l — t) is 



P(t b \T)=a(t b )exp 



\k=l 



= c(l-t&) 



cJ a 



where J 



a,f) 



H. 



a, 



T K 



H 



a,0 



(12) 
with n b e 



This generalization of the DDT allows non-binary tree 
structures to be learnt. By varying a we can move be- 
tween flat (large a) and hierarchical clusterings (small 
a), as shown in Figure [2] 

4 Model 

To complete the model we must specify a likelihood 
function for the data given the leaf locations of the 
PYDT, and priors on the hyperparameters. We use a 
Gaussian observation model for multivariate continu- 
ous data and a probit model for binary vectors. We 
specify the following priors on the hyperparameters: 



a ~ G(a a , b a ) 
c — G(a c , b c ) 



/? ~ Beta(a/3, bp) 
I /a 2 ~ G(o, 
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Figure 3: A sample from the Pitman- Yor Diffusion 
Tree with N = 20 datapoints and a(t) = 1/(1 — t), a = 
l,/3 = showing the branching structure including 
non-binary branch points. 



(a) c = 1, a = 0,j3 = (b) c = 1, a = 0.5, P = 
(DDT) 



(c) c= 1,q = l,/3 = (d) c = 3, a = 1.5,/? = 

Figure 4: Samples from the Pitman- Yor Diffusion Tree 
with N = 1000 datapoints in D = 2 dimensions and 
a(t) = c/(l — i). As a increases more obvious clusters 
appear. 



where G(a, b) is a Gamma distribution with shape a 
and rate b. In all experiments we used a a — 2, b a = 
.5, ap = 1, bp = 1, a c = l,b c = 1, a CT 2 = 1, b a 2 = 1. 



5 Inference 

We propose two inference algorithms: an MCMC sam- 
pler and a more computationally efficient greedy EM 
algorithm. Both algorithms marginalize out the loca- 
tions of internal nodes using belief propagation, and 
are capable of learning the hyperparameters c, a 2 , a 
and 8 if desired. 



5.1 MCMC sampler 

We construct an MCMC sampler to explore the pos- 
terior over the tree structure, divergence times and 
hyperparameters. To sample the structure and diver- 
gence times a subtree is chosen uniformly at random 
to be detached (the subtree may be a single leaf node) . 
To propose a new position in the tree for the subtree, 
we follow the procedure for generating a new sample 
on the remaining tree. The subtree is attached wher- 
ever divergence occurred, which may be on a segment, 
in which case a new parent node is created, or at an ex- 
isting internal node, in which case the subtree becomes 
a child of that node. If divergence occurred at a time 
later than the divergence time of the root of the sub- 
tree we must repeat the procedure until this is not the 
case. The marginal likelihood of the new tree is calcu- 
lated, marginalizing over the internal node locations, 
and excluding the structure and divergence time con- 
tribution since this is accounted for by having sampled 
the new location according to the prior. The ratio to 
the marginal likelihood for the original tree gives the 
Metropolis factor used to determine whether this move 
is accepted. Unfortunately it is not possible to slice 



sample the position of the subtree as in Neal ( 2003b ) 



because of the atoms in the prior at each branch point. 

Smoothness hyperparameter c. From Equa- 
tion [12] the Gibbs conditional for c is 



G U c + |Z|,& c + ^J^log(l~^) 
V iei / 



where 2 is the set of internal nodes of the tree. 



(13) 



Data variance a 2 . It is straightforward to sample 
1/cr 2 given divergence locations. Having performed 
belief propagation it is easy to jointly sample the di- 
vergence locations using a pass of backwards sampling. 
From Equation [l0| the Gibbs conditional for the preci- 



sion 1/cr 2 is then 



G(a CT 2,6 ff2 ) \\ G(£)/2 + l, 



[ab]eS(T) 

where II • II denotes Euclidean distance. 



2{t b -t a ) 



(14) 



Pitman- Yor hyperparameters a, 8. We use slice 



sampling (Neal 2003a[ ) to sample a and 8. We repa- 
rameterize in terms of the logarithm of a and the logit 
of 8 to extend the domain to the whole real line. The 
terms required to calculate the conditional probability 
are those in Equations [7] and [9j 

5.2 Greedy Bayesian EM algorithm 

As an alternative to MCMC here we use a Bayesian 
EM algorithm to approximate the marginal likelihood 
for a given tree structure, which is then used to drive a 
greedy search over tree structures, following our work 
in 



Knowles et al. (2011). 



EM algorithm. In the E-step, we use message pass- 
ing to integrate over the locations and hyperparame- 
ters. In the M-step we maximize the lower bound on 
the marginal likelihood with respect to the divergence 
times. For each node i with divergence time U we have 
the constraints t p < ti < min (ti,t r ) where ti,t r ,t p are 
the divergence times of the left child, right child and 
parent of i respectively. 

We jointly optimize the divergence times using 
LBFGS ( TLiuand Nocedal[|1989| ). Since the divergence 
times must lie within [0, 1] we use the reparameteri- 
zation Sj = log [ti/(l — t^] to extend the domain to 
the real line, which we find improves empirical perfor- 
mance. From Equations 10 and 12 the lower bound 



on the log evidence is a sum over all branches [pi] of 
expressions of the form: 



(( C ) rj - 1) log (i - to - § log ( ti - t p ) {±) 



'[P<] 



(15) 



where b[ pi ] = \ J2d=i ^[( x di~x dp ) 2 ], x d i is the location 
of node i in dimension d, and p is the parent of node 
i. The full lower bound is the sum of such terms over 
all nodes. The expectation required for is readily 
calculated from the marginals of the locations after 
message passing. Differentiating to obtain the gradient 
with respect to ti is straightforward so we omit the 
details. Although this is a constrained optimization 
problem (branch lengths cannot be negative) it is not 
necessary to use the log barrier method because the 
l/(ti — t p ) terms in the objective implicitly enforce the 
constraints. 



Hyperparameters. We use variational inference to 
learn Gamma posteriors on the inverse data variance 
1 /a 2 and smoothness c. The variational updates for c 
and l/cr 2 are the same as the conditional Gibbs dis- 
tributions in Equations [13] and [14] respectively. We 
optimize a and /3 by coordinate descent using golden 
section search on the terms in Equations [7] and [9] 



Search over tree structures The EM algorithm 
approximates the marginal likelihood for a fixed tree 
structure T. We maintain a list of A"-best trees (typ- 
ically K — 10) which we find gives good empirical 
performance. Similarly to the sampler, we search the 
space of tree structures by detaching and re-attaching 
subtrees. We choose which subtree to detach at ran- 
dom. We can significantly improve on re-attaching at 
random by calculating the local contribution to the 
evidence that would be made by attaching the root of 
the subtree to the midpoint of each possible branch 
and at each possible branch point. We then run EM 
on just the three best resulting trees. We found con- 
struction of the initial tree by sequential attachment 
of the data points using this method to give very good 
initializations. 



5.3 Predictive distribution 

To calculate the predictive distribution for a specific 
tree we compute the distribution for a new data point 
conditioned on the posterior location and divergence 
time marginals. Firstly we calculate the probability 
of diverging from each branch according to the data 
generating process described in Section [I] Secondly 
we draw several (typically three) samples of when di- 
vergence from each branch occurs. Finally we calcu- 
late the Gaussian at the leaves resulting from Brow- 
nian motion starting at the sampled divergence time 
and location up to to t = 1. This results in a predic- 
tive distribution represented as a weighted mixture of 
Gaussians. Finally we average the density from a num- 
ber of samples from the sampler or the K -best trees 
found by the EM search algorithm. 



5.4 Likelihood models 

Connecting our PYDT module to different likelihood 
models is straightforward: we use a Gaussian observa- 
tion model and a probit model for binary vectors. The 
MCMC algorithm slice samples auxiliary variables and 



the EM algorithm uses EP (Minka 2001) on the pro- 
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Figure 5: Optimal trees learnt by the greedy EM algo- 
rithm for the DDT and PYDT on a synethic dataset 
with D = 2, N = 100. 



6 Results 

We present results on synthetic and real world data, 
both continuous and binary. 

6.1 Synthetic data 

We first compare the PYDT to the DDT on a simple 
synthetic dataset with D = 2, N = 100, sampled from 
the density 

f(x,y) = ± J2 E N(x;x,l/8)N(y;y,l/8) 
sel-i,i] ye[-i,i] 

The optimal trees learnt by 100 iterations of the greedy 
EM algorithm are shown in Figure [5] While the DDT 
is forced to arbitrarily choose a binary branching struc- 
ture over the four equi-distant clusters, the PYDT is 
able to represent the more parsimonious solution that 
the four clusters are equally dependent. Both mod- 
els find the fine detail of the individual cluster sam- 
ples which may be undesirable; investigating whether 
learning a noise model for the observations alleviates 
this problem is a subject of future work. 

6.2 Density modeling 



In Adams et al. (2008) the DDT was shown to be 
an excellent density model on a D = 10, N — 228 
dataset of macaque skull measurements, outperform- 
ing a kernel density and infinite mixture of Gaussians, 
and sometimes the Gaussian process density sampler 
itself. We compare the PYDT to the DDT on the 
same dataset, using the same data preprocessing and 
same three train test splits (iVtrain = 200, N tos t = 28) 
as |Adams et al.| (120081) . The performance using the 



bit factor, implemented using the runtime component 



of the Infer.NET framework Minka et al. (2010). 



MCMC sampler is shown in Figure [Bj The PYDT 
finds trees with higher marginal likelihood than the 
DDT, which corresponds to a moderate improvement 
in predictive performance. Inference in the PYDT is 
actually slightly more efficient computationally than 
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Figure 6: Density modeling of the D = 10, N = 200 
macaque skull measurement dataset of |Adams et al.] 
(2008). Top: Improvement in test predictive likeli- 
hood compared to a kernel density estimate. Bottom: 
Marginal likelihood of current tree. The shared x-axis 
is computation time in seconds. 

in the DDT because the on average smaller number of 
internal nodes reduces the cost of belief propagation 
over the divergence locations, which is the bottleneck 
of the algorithm. 

6.3 Binary example 

To demonstrate the use of an alternative observation 
model we use a probit observation model in each di- 
mension to model 102-dimensional binary feature vec- 
tors relating to attributes (e.g. being warm-blooded, 
having two legs) of 33 animal species from Kemp and 
Tenenbaum] ( |2008| ). The MAP tree structure learnt 
using EM, is shown in Figure [7| is intuitive, with sub- 
trees corresponding to land mammals, aquatic mam- 
mals, reptiles, birds, and insects (shown by colour cod- 
ing). Note that penguins cluster with aquatic species 
rather than birds, which is not surprising since there 
are attributres such as "swims" , "flies" and "lives in 
water" . 

7 Conclusion 

We have introduced the Pitman- Yor Diffusion Tree, a 
Bayesian nonparametric prior over tree structures with 
arbitrary branching structure at each branch point. 
We have shown the PYDT defines an infinitely ex- 
changeable distribution over data points. We demon- 
strated an MCMC sampler and Bayesian EM with 
greedy search, both using message passing on the tree 
structure. More advanced MCMC methods could be 
of use here. Quantitatively we have shown a modest 
improvement relative to the DDT on a density estima- 
tion task. However, we see improved interpretability 
as the key benefit of removing the restriction to binary 
trees, especially since hierarchical clustering is typi- 
cally used as a data exploration tool. Qualitatively, 
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Figure 7: Tree structure learnt for the animals dataset 



of Kemp and Tenenbaum ( 2008 ) 



we have shown the PYDT can find simpler, more in- 
terpretable representations of data than the DDT. To 
encourage the use of the PYDT by the community we 
will make our code publicly available. 

In ongoing work we use the PYDT to learn hierarchi- 
cal structure over latent variables in models including 
Hidden Markov Models, specifically in part of speech 



tagging (JKupiec 1992) where a hierarchy over the la- 
tent states aids interpretability, and Latent Dirichlct 
Allocation, where it is intuitive that topics might be 
hierarchically clustered ( jBlei et ah 2004). Another 



interesting direction would be to use the prior over 
branching structures implied by the PYDT in the an- 



notated hierarchies model of Roy et al. (2007) 



A Probability of tree structure 

For segment [ab] , recall that i is the index of the sam- 
ple which created the branch point at b. Thus i — 1 
samples did not diverge at b so do not contribute any 
terms. Let the final number of branches from b be Kf,, 
and the number of samples which followed each branch 
be n fe := {n\ : k G [1 . . . K b }}. The probability of the 
i-th sample having diverged to form the branch point 
j g a(t b )r(i-i-ff) ^ n ow we w i sn t calculate the proba- 
bility of thefinal branching structure at b. Following 
the divergence of sample i there are Kb— 2 samples who 



form new branches from the same point, contributing 
a + (k— to the numerator for k £ {3, . . . , Kb}. Let 
c; be the number of samples having previously followed 
path I, so that q ranges from 1 to n\ — 1 (apart from 
Ci which only ranges from i — 1 to n\ — 1). The j-th 
sample contributes a factor j — 1 + a to the denomina- 
tor. The total number of datapoints which traversed 
edge [ab] is m(b) = Y^j=i n \- The factor associated 
with this branch point is then: 

n£ 3 [« + (k m nfc-ifo - g) n£ 2 nfc 1 ^ - g) 

n22-iC?-i+«) 
= ng> + (fc - nfc 1 ^ - g) 
nSliCj-i+ajnU^-/ 3 ) 

= Q§> + (fc - Iffii + a) ngt r(n? - g) 
r(m(6) + a)r(i - 1 + /3) 
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